[configure] Backend Performance Requirements for etcd by jing2uo · Pull Request #148 · alauda/knowledge

jing2uo · 2026-04-22T07:08:14Z

新增一篇 ACP KB 文章，归入 configure 区域。

✅ 自动化验证通过 — 3 / 3 条验证步骤在真实 Kubernetes 集群上按文章命令跑通（2026-04-22T13:12:08Z）。

`configure` 区域建议 reviewer

按 kb/OWNERS.md + kb/KB_REVIEWERS.md 该区域的活跃人自动挑选，@ 错了请无视。

@changluyi @zhangzujian @oilbeater

没有 GitHub handle 的贡献者（本区域相关请人工 ping）：

cong <q1875486458@gmail.com>
MeoK <zhouxuan0922@outlook.com>

coderabbitai · 2026-04-22T07:08:21Z

Walkthrough

A new troubleshooting documentation page was added for etcd backend performance degradation, providing issue characterization, root-cause explanation, and resolution guidance with diagnostic procedures, commands, and monitoring thresholds.

Changes

Cohort / File(s)	Summary
etcd Performance Troubleshooting Documentation `docs/en/solutions/Backend_Performance_Requirements_for_etcd.md`	New documentation page describing etcd backend performance degradation: issue symptoms with log message examples, root-cause analysis linking backend bottlenecks to missed heartbeats and slow requests, and resolution procedures including fio disk I/O benchmarking (p99 fdatasync latency target), Prometheus metric monitoring with p99 thresholds, network health checks (RTT/packet loss), optional database defragmentation via etcdctl, and diagnostic commands using kubectl and curl.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Poem

🐰 A doc for etcd's troubles is here,
Performance tips crystal and clear,
With benchmarks and thresholds to check,
No heartbeats shall go to heck! 💫

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly summarizes the main change: adding documentation about backend performance requirements for etcd, which matches the new file added and the PR's explicit objective.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch kb/2026-04-21/backend-performance-requirements-for-etc

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/en/solutions/Backend_Performance_Requirements_for_etcd.md`:
- Around line 64-72: Add an explicit safety note to the etcd defragmentation
snippet instructing operators to run etcdctl defrag on one etcd member at a time
(sequentially, not concurrently) using the existing kubectl exec ...
etcd-<node-name> -- etcdctl defrag command; update the paragraph around the
command (referencing the "etcdctl defrag" and "kubectl exec -n kube-system
etcd-<node-name>" text) to state clearly to perform defrag on a single member,
wait for that member to rejoin/settle, then proceed to the next member to avoid
control-plane disruption.
- Around line 13-18: The fenced log block is missing a language tag which
triggers markdownlint MD040; update the code fence that contains the lines
beginning with "etcdserver: failed to send out heartbeat..." and the subsequent
etcdserver/wal lines by adding a language identifier (e.g., "text") after the
opening ``` so the block reads like ```text to satisfy the linter and preserve
formatting.
- Around line 41-45: The fio benchmark currently uses the --fsync=1 flag which
measures fsync (data+metadata) but the documented SLA and WAL behavior require
measuring fdatasync; update the fio invocation in the docs (the command line
containing fio --name=etcd-io-test ...) to replace --fsync=1 with --fdatasync=1
so the 99th-percentile fdatasync latency threshold (<10 ms) is validated
correctly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e512c52c-72ca-4013-a6cf-cbe02c53c390

📥 Commits

Reviewing files that changed from the base of the PR and between 7a0b251 and e0e0467.

📒 Files selected for processing (1)

docs/en/solutions/Backend_Performance_Requirements_for_etcd.md

coderabbitai · 2026-04-22T09:03:57Z

+```
+etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for xxx ms)
+etcdserver: server is likely overloaded
+etcdserver: read-only range request "key:\"xxxx\"" count_only:true with result "xxxx" took too long (xxx s) to execute
+wal: sync duration of xxxx s, expected less than 1s
+```


⚠️ Potential issue | 🟡 Minor

Add a language tag to the log code fence.

The fenced block at Line 13 is missing a language identifier, which triggers markdownlint MD040.

✅ Suggested patch

-``` +```text etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for xxx ms) etcdserver: server is likely overloaded etcdserver: read-only range request "key:\"xxxx\"" count_only:true with result "xxxx" took too long (xxx s) to execute wal: sync duration of xxxx s, expected less than 1s

</details> <details> <summary>🧰 Tools</summary> <details> <summary>🪛 markdownlint-cli2 (0.22.0)</summary> [warning] 13-13: Fenced code blocks should have a language specified (MD040, fenced-code-language) </details> </details> <details> <summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @docs/en/solutions/Backend_Performance_Requirements_for_etcd.md around lines
13 - 18, The fenced log block is missing a language tag which triggers
markdownlint MD040; update the code fence that contains the lines beginning with
"etcdserver: failed to send out heartbeat..." and the subsequent etcdserver/wal
lines by adding a language identifier (e.g., "text") after the opening so the block reads liketext to satisfy the linter and preserve formatting.

</details>  

coderabbitai · 2026-04-22T09:03:57Z

+fio --name=etcd-io-test --ioengine=sync --bs=4k --numjobs=1 --size=512M \
+    --rw=write --iodepth=1 --fsync=1 --runtime=30 --time_based
+```
+
+The 99th percentile fdatasync latency must be under **10 ms**.


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In fio, what is the behavioral difference between --fsync=1 and --fdatasync=1, and which one best matches etcd WAL fdatasync latency validation guidance?

💡 Result:

In fio, --fsync=1 issues fsync(2) after every block written, syncing both data and metadata to disk. --fdatasync=1 issues fdatasync(2) after every block, syncing only data (not metadata unless required for data retrieval), falling back to fsync on unsupported platforms. --fdatasync=1 best matches etcd WAL fdatasync latency validation guidance. Etcd primarily uses fdatasync for WAL syncs (fsync only on create/close/repair/cut), and official validation uses fio with --fdatasync=1 (e.g., fio --rw=write --ioengine=sync --fdatasync=1 --bs=2300 --size=XXm), checking 99th percentile fdatasync latency <10ms.

Citations:

1: https://www.flamingbytes.com/blog/fio-fsync-end_fsync-fdatasync-and-sync/

2: https://fio.readthedocs.io/en/latest/fio_doc.html

3: discuss: fdatasync over fsync in WAL etcd-io/etcd#18768

4: https://prohoster.info/en/blog/administrirovanie/kak-s-fio-proverit-diski-na-dostatochnuyu-proizvoditelnost-dlya-etcd

5: https://docs.okd.io/latest/etcd/etcd-practices.html

6: https://eng.d2iq.com/blog/etcd-performance-benchmarking/

7: https://etcd.io/docs/v3.4/op-guide/hardware

🏁 Script executed:

fd "Backend_Performance_Requirements_for_etcd.md" --type f

Repository: alauda/knowledge

Length of output: 122

🏁 Script executed:

sed -n '41,45p' docs/en/solutions/Backend_Performance_Requirements_for_etcd.md

Repository: alauda/knowledge

Length of output: 266

Change --fsync=1 to --fdatasync=1 to match the documented threshold.

The fio command uses --fsync=1 (which syncs data and metadata), but the threshold requires fdatasync latency <10ms (which syncs data only). This mismatch means the benchmark measures the wrong primitive. etcd WAL operations primarily use fdatasync, not fsync. Update the command to --fdatasync=1 to correctly validate etcd performance.

Suggested patch

fio --name=etcd-io-test --ioengine=sync --bs=4k --numjobs=1 --size=512M \ - --rw=write --iodepth=1 --fsync=1 --runtime=30 --time_based + --rw=write --iodepth=1 --fdatasync=1 --runtime=30 --time_based

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

fio --name=etcd-io-test --ioengine=sync --bs=4k --numjobs=1 --size=512M \

--rw=write --iodepth=1 --fsync=1 --runtime=30 --time_based

```

The 99th percentile fdatasync latency must be under **10 ms**.

fio --name=etcd-io-test --ioengine=sync --bs=4k --numjobs=1 --size=512M \

--rw=write --iodepth=1 --fdatasync=1 --runtime=30 --time_based

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/en/solutions/Backend_Performance_Requirements_for_etcd.md` around lines 41 - 45, The fio benchmark currently uses the --fsync=1 flag which measures fsync (data+metadata) but the documented SLA and WAL behavior require measuring fdatasync; update the fio invocation in the docs (the command line containing fio --name=etcd-io-test ...) to replace --fsync=1 with --fdatasync=1 so the 99th-percentile fdatasync latency threshold (<10 ms) is validated correctly.

coderabbitai · 2026-04-22T09:03:57Z

+If the database size approaches the quota, perform manual defragmentation:
+
+```bash
+kubectl exec -n kube-system etcd-<node-name> -- etcdctl defrag \
+  --endpoints=https://127.0.0.1:2379 \
+  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
+  --cert=/etc/kubernetes/pki/etcd/server.crt \
+  --key=/etc/kubernetes/pki/etcd/server.key
+```


⚠️ Potential issue | 🟠 Major

Add a defrag safety note (one member at a time).

This runbook should explicitly instruct sequential defragmentation (not all members concurrently) to reduce control-plane disruption risk.

✅ Suggested patch

### Database Defragmentation If the database size approaches the quota, perform manual defragmentation: +Run defragmentation on **one etcd member at a time** and wait for the member to become healthy before moving to the next member. ```bash kubectl exec -n kube-system etcd-<node-name> -- etcdctl defrag \

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

If the database size approaches the quota, perform manual defragmentation:

```bash

kubectl exec -n kube-system etcd-<node-name> -- etcdctl defrag \

--endpoints=https://127.0.0.1:2379 \

--cacert=/etc/kubernetes/pki/etcd/ca.crt \

--cert=/etc/kubernetes/pki/etcd/server.crt \

--key=/etc/kubernetes/pki/etcd/server.key

```

If the database size approaches the quota, perform manual defragmentation:

Run defragmentation on **one etcd member at a time** and wait for the member to become healthy before moving to the next member.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/en/solutions/Backend_Performance_Requirements_for_etcd.md` around lines 64 - 72, Add an explicit safety note to the etcd defragmentation snippet instructing operators to run etcdctl defrag on one etcd member at a time (sequentially, not concurrently) using the existing kubectl exec ... etcd-<node-name> -- etcdctl defrag command; update the paragraph around the command (referencing the "etcdctl defrag" and "kubectl exec -n kube-system etcd-<node-name>" text) to state clearly to perform defrag on a single member, wait for that member to rejoin/settle, then proceed to the next member to avoid control-plane disruption.

[configure] Backend Performance Requirements for etcd

22a9dad

jing2uo temporarily deployed to translate April 22, 2026 07:08 — with GitHub Actions Inactive

jing2uo requested a review from oilbeater April 22, 2026 07:08

[configure] Backend Performance Requirements for etcd

e0e0467

jing2uo temporarily deployed to translate April 22, 2026 09:00 — with GitHub Actions Inactive

coderabbitai Bot reviewed Apr 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[configure] Backend Performance Requirements for etcd#148

[configure] Backend Performance Requirements for etcd#148
jing2uo wants to merge 2 commits intomainfrom
kb/2026-04-21/backend-performance-requirements-for-etc

jing2uo commented Apr 22, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Apr 22, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 22, 2026

Uh oh!

coderabbitai Bot Apr 22, 2026

Uh oh!

coderabbitai Bot Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jing2uo commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

configure 区域建议 reviewer

Uh oh!

coderabbitai Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jing2uo commented Apr 22, 2026 •

edited

Loading

`configure` 区域建议 reviewer

coderabbitai Bot commented Apr 22, 2026 •

edited

Loading